source("src/01-import-de-pt.R", local = knitr::knit_global())
source("src/data-cleaning.R", local = knitr::knit_global())
de.imported <- de # keep a copy around for debugging
pt <- clean_pt_variables(pt)
# create a smaller presidents table just for the study period
studyperiod.start <- as.Date("1982-10-10")
studyperiod.end <- Sys.Date()
pt.study <- pt %>% filter((first_day >= studyperiod.start) & (last_day <= studyperiod.end))
One aspect built into our dataset is location information. Five columns in Google Sheets and equivalently five variables in R record location information.
| address | community | municipality | province | department |
|---|---|---|---|---|
| Address/Intersection/Place | Community/Neighborhood | Municipality | Province | Department |
| Dirección/Intersección/Lugar | Comunidad/Barrio | Municipalidad | Provincia | Departamento |
This information can be used in tables, or as axes in visualization, but the most effective way to look at this information is on maps. Visualizing Municipality as a region on a choropleth map is the focus of another subproject, but in this document, we’ll look at how to generate points on a map from this information. Associating data with geographic location is called geocoding the data.
Ultimately, we’d like to be able to generate point-based visualizations of deaths in the dataset, but this will require both cleaning up the detailed location information we have and being realistic about handling the many deaths where we don’t have full information.
As of this writing we have address information for
261 deaths, community or neighborhood information for
sum(!is.na(de$community)) and municipality
for 574 deaths, out of a total of 644
entries. This means that we’ll have to use community, neighborhood, or
municipality as a proxy for hundreds of locations.
Here’s an example, generating using Google Pivot Tables, of how
geocoding can power information rich mapping:
Google Pivot Tables is now defunct, and we’d like to generate this kind of visualization using open data tools that can be embedded in RMarkdown scripts. To do this, we need to:
Geocoding requires performing look-ups to online databases, which requires a network connection, can be a limited resource (databases either restrict frequency of requests, or charge for extended usage), and involves two kinds of error correction, either finessing our location data so it can be found or finessing the outputs in ambiguous cases. Accordingly our desired workflow will gather all the location information, run it through the look-up process in one batch, and save the location data obtained for future use. We’ll also need to either… run the entire batch again when new locations are added, or create a tool to only look up new locations.
On the other hand, mapping using location data is relatively straightforward. The key challenge may be keeping track of data points with widely varying levels of precision. We should restrict how much points can be zoomed in on, since all the deaths at the “La Paz, Murillo, La Paz” point—referring to the capital city—will appear to be on a single street corner in a single neighborhood in that city.
We’re going to narrow our dataset down to just event_title, date, and location data, and then create search strings.
To make our dataset simpler for looking through location entries, we’ll simplify dates and put them on the left of the data. We’ll also combine location data together into long search strings for municipalities and for addresses.
library(lubridate)
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
de.dated <- de %>%
mutate(
date = (paste(year, month, day, sep="-")
%>% ymd() %>% as.Date())) %>%
mutate(date_name = format(date, format="%B %d")) %>%
mutate(year = NULL, month = NULL, day=NULL) %>%
relocate(event_title, date)
## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `date = (paste(year, month, day, sep = "-") %>% ymd() %>%
## as.Date())`.
## Caused by warning:
## ! 15 failed to parse.
de.municipality <- de.dated %>%
mutate(municipality = str_c(municipality, "Municipality", sep=" ")) %>%
unite("location", municipality:department, sep=", ",
na.rm=TRUE, remove=FALSE) %>% select(event_title, date, location)
de.municipality.dist <- de.municipality %>% arrange(location) %>% distinct(location)
reactable(de.municipality)
# Simplify down to a list of unique municipalities, prepped for bulk geocoding
municipalities <- str_unique(de.municipality$location)
municipalities_df <- data.frame(municipality = municipalities, stringsAsFactors = FALSE)
# Process the longer addresses in the same way
de.address <- de %>% filter((!is.na(address))&(!is.na(community))) %>%
mutate(community=NULL) %>%
unite("location", address:department, sep=", ",
na.rm=TRUE, remove=FALSE) %>% select(event_title, location)
de.address.dist <- de.address %>% arrange(location) %>% distinct(location)
reactable(de.address)
addresses <- str_unique(de.address$location)
addresses_df <- data.frame(address = addresses, stringsAsFactors = FALSE)
Currently our address data is multipurpose. I suggest we revise the location columns as follows:
| place | address | community | municipality | province | department |
|---|---|---|---|---|---|
| Place | Address/Intersection | Community/Neighborhood | Municipality | Province | Department |
| Lugar | Dirección/Intersección | Comunidad/Barrio | Municipalidad | Provincia | Departamento |
… where place includes any descriptive information and
address is set up, and checked to be understood by
geocoding look-up services. Because changes to the column structure of
the Google sheets require a change to the import code, we’ll make all
changes at one time this semester.
We have two main geocoding options…
The Google Places API works between a search bar string and a place record. It’s input string format is relatively open, while its output place format is relatively elaborate.
Load this package using install.packages("ggmap").
Register your API key with
register_google(key = "--insert--key--here--") once per R
session.
Sample geocoding call:
library(ggmap)
## ℹ Google's Terms of Service: <]8;;https://mapsplatform.google.comhttps://mapsplatform.google.com]8;;>
## ℹ Please cite ggmap if you use it! Use `citation("ggmap")` for details.
# Insert register_google() call here
loc.villatunari <- "Villa Tunari, Cochabamba, Bolivia"
# geocode described here: https://www.rdocumentation.org/packages/ggmap/versions/3.0.1/topics/geocode
gc.villatunari <- geocode(
loc.villatunari,
output = "latlon",
urlonly = FALSE,
override_limit = FALSE,
region = "bo"
)
## ℹ <]8;;https://maps.googleapis.com/maps/api/geocode/json?address=Villa+Tunari,+Cochabamba,+Bolivia&key=xxxhttps://maps.googleapis.com/maps/api/geocode/json?address=Villa+Tunari,+Cochabamba,+Bolivia&key=xxx]8;;>
Some tips: - Adding region="bo" should direct our
searches to Bolivia. - We could try
location = "<municipality text>" to see if it
improves output for address searches. - “Santa Cruz” on its own ends up
in California; “Potosí” seems to prefer San Luis Potosí in Mexico. Will
check if the regional settings help for these.
Note: Google’s API will bill for search requests, at $5 per 1000 queries, but provides the first $200 annually for free. We have an API key to make this happen, which for obvious reasons should not be posted outside our private repository, and must be scrubbed before we make the repository public.
Install packages using install.packages("tmaptools") and
install.packages("tmap").
Registration isn’t required, but bulk calls are discouraged and the API asks for one request per second. Unclear if geocode_OSM() automatically throttles requests down to this speed.
library(tmaptools)
library(tmap)
bolivia.bb <- bb (x="Bolivia")
villatunari <- geocode_OSM(q="Villa Tunari, Cochabamba, Bolivia")
villatunari
## $query
## [1] "Villa Tunari, Cochabamba, Bolivia"
##
## $coords
## x y
## -65.42003 -16.97255
##
## $bbox
## xmin ymin xmax ymax
## -66.33297 -17.31209 -64.72244 -15.78921
huatajata <- geocode_OSM(q="Huatajata, Omasuyos, La Paz", geometry = "point")
huatajata
## $query
## [1] "Huatajata, Omasuyos, La Paz"
##
## $coords
## x y
## -68.70049 -16.20886
##
## $bbox
## xmin ymin xmax ymax
## -68.72049 -16.22886 -68.68049 -16.18886
geocode_OSM(q="Huatajata, Omasuyos, La Paz", geometry = "point",
details = FALSE, as.data.frame = TRUE)
## query lat lon lat_min lat_max lon_min
## 1 Huatajata, Omasuyos, La Paz -16.20886 -68.70049 -16.22886 -16.18886 -68.72049
## lon_max
## 1 -68.68049
Here’s the call to geocode in bulk, then save the results.
locations_df <- mutate_geocode(municipalities_df, location = municipality)
saveRDS(locations_df, file = "data/municipalities-table-gg.RDS", compress = FALSE)
When we’re not doing the first run, we load the results and then join them to the deaths table.
locations_df <- readRDS(file="data/municipalities-table-gg.RDS")
de.municipality <- dplyr::rename(de.municipality, municipality = location)
de.municipality.located <- left_join(de.municipality, locations_df)
## Joining with `by = join_by(municipality)`
reactable(de.municipality.located)
Finally, let’s map them…
locations <- as_tibble(de.municipality.located)
locations <- locations %>% filter(!is.na(lat) & !is.na(lon))
locations_sf <- st_as_sf(locations, coords = c("lon", "lat"), crs = 4326)
mapview(locations_sf, legend=FALSE)
locations_a_df <- mutate_geocode(addresses_df, location = address)
saveRDS(locations_a_df, file = "data/addresses-table-gg.RDS", compress = FALSE)
locations_a_df <- readRDS(file = "data/addresses-table-gg.RDS")
locations_a <- as_tibble(locations_a_df)
locations_a <- locations_a %>% filter(!is.na(lat) & !is.na(lon))
locations_a_sf <- st_as_sf(locations_a, coords = c("lon", "lat"), crs = 4326)
mapview(locations_a_sf, legend=FALSE)
We’ll want a slightly different search string without the “Municipality” inserted. Otherwise this is the same as above.
de.municipality <- de.dated %>%
unite("location", municipality:department, sep=", ",
na.rm=TRUE, remove=FALSE) %>% select(event_title, date, location)
de.municipality.dist <- de.municipality %>% arrange(location) %>% distinct(location)
municipalities <- str_unique(de.municipality$location)
municipalities_df <- data.frame(municipality = municipalities, stringsAsFactors = FALSE)
muni10 <- head(municipalities_df, 10) # only process the first 10
nominatim_loc_geo <- geocode_OSM(muni10$municipality,
details = FALSE, as.data.frame = TRUE)
## No results found for "San Ignacio, José Miguel de Velasco, Santa Cruz".
# saveRDS(nominatim_loc_geo, file = "data/municipalities-table-osm.RDS", compress = FALSE)
locations_osm <- as_tibble(nominatim_loc_geo )
locations_osm <- locations_osm %>% filter(!is.na(lat) & !is.na(lon))
locations_osm_sf <- st_as_sf(locations_osm, coords = c("lon", "lat"), crs = 4326)
mapview(locations_osm_sf, legend=FALSE)
Note that we’re just doing 10 points for now, but will geocode the whole batch eventually, and then save it to disk as well.
As I mentioned above: Geocoding … involves two kinds of error correction, either finessing our location data so it can be found or finessing the outputs in ambiguous cases. Accordingly our desired workflow will gather all the location information, run it through the look-up process in one batch, and save the location data obtained for future use.
Right now we have several tasks to make this run smoother, with Nathan in the lead:
And we can left_join these coordinates back to the original dataset and get to work on presenting information on the map. Mihir, if you’re up for experimenting with this, that would be great.
de.municipality.located put into mapview() and
its relevant columns include in tooltips.P.S. I, Carwil, will bring this roadmap into GitHub, and then work out how to make it into issues for you all to work through.